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Abstract. One of the main applications of balance indices is in tests of null models of 
evolutionary processes. The knowledge of an exact formula for a statistic of a balance index, 
holding for any number n of leaves, is necessary in order to use this statistic in tests of this 
kind involving trees of any size. In this paper we obtain exact formulas for the variance 
under the Yule model of the Sackin index and the total cophenetic index of binary rooted 
phylogenetic trees with n leaves. We also obtain the covariance of these indices. 

1 Introduction 

One of the most thoroughly studied properties of the topology of phylogenetic trees is 
their symmetry, that is, the degree to which both children of each internal node tend to 
have the same number of descendant taxa. The symmetry of a tree is usually quantified 
by means of balance indices. Many such indices have been proposed so far in the literature 
Chap. 33]. One of the most popular is Sackin's index S [15], which is defined as the 
sum of the depths of all leaves in the tree. We have recently proposed an extension of 
Sackin's index, the total cophenetic index <& [11]: the sum, over all pairs of different 
leaves of the tree, of the depth of their least common ancestor. The main advantages 
of <£> over S are that it has a larger range of values and a smaller probability of ties. 
Moreover, <P retains other good properties of S: it makes sense for not necessarily fully 
resolved phylogenetic trees (unlike another popular balance index, Colless' [1]), it can be 
computed in linear time, and the statistical properties of its distribution of values can 
be studied under different stochastic models of evolution, like for instance the Yule |7)21j 
and the uniform |3|14|18j models. This last property is relevant because one of the main 
applications of balance indices is their use as tools to test stochastic models of evolution 

[hue]. 

Exact formulas for the expected values of S and on the space T n of fully resolved 
rooted phylogenetic trees with n leaves have been published for the Yule and the uniform 
models. In particular, if we denote by H n the n-th harmonic number, i.e., 

n 1 
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these expected values under the Yule model on T n are, respectively, 

E Y {S n ) = 2n(H n - 1) [8] 
E Y ($ n ) = n(n - 1) - 2n(H n - 1) [II] 

As we already pointed out [IT] , these formulas imply that the expected value under 
the Yule model of the sum <P = S + <P on T n is n(n — 1), a quite simpler expression than 



those for Ey{S^) or Eyipn)- This index <P has the same good properties of ^, but the 
formulas for its statistics under the Yule model tend to be simpler than the corresponding 
formulas for other indices. We shall find here another example of this fact: the variance. 

The goal of this paper is to provide exact formulas for the variance of S, <P and 
and the covariances between S and the other two, on T n and under the Yule model. The 
variance of S on T n under this model was known so far only for its limit distribution 
when n — > oo p], being 




Also, Rogers [13J found a recursive formula for the moment-generating functions of S 
under this model, which allowed him to compute a Y (S n ) for n = 1, . . . , 50, but he did 
not obtain any explicit exact formula for this variance. 

In this paper we obtain the following exact formulas for these variances: 

a Y (S n ) = 7n 2 - An 2 HP - 2nH n - n 

of(<P n ) = i^(n 4 - 10n 3 + 131n 2 - 2n) - 6nH n - \nH 2 n - 4n(n - l)Hj® 

(2) n 9 

where Hn = V* • We a ^ so obtain the following exact formulas for the covariances, 

i=l _ 

under the Yule model, of S n and <P n ,<P n : 

covY(S n ,$ n ) = 4n(HP + H n ) + \n{n 2 - 51n + 2) 

o 

cov Y (S n ,~<P n ) = 2nH n + \n(n 2 - 9n - 4) 

o 

These formulas are valid for any number n of leaves, and therefore they can be used in 
a meaningful way in tests involving trees of any size. The proofs of all these formulas 
consist of elementary, although technically involved, algebraic computations. 

The rest of this paper is organized as follows. In a first section on Preliminaries 
we gather some notations and conventions on phylogenetic trees and some lemmas on 
probabilities of trees under the Yule model and on harmonic numbers. In the next section, 
we establish a recursive formula for the expected value under the Yule model of the square 
of a balance index satisfying a certain kind or recursion (a recursive shape index [10j ) 
that lies at the basis of all our computations. Then, we devote a series of sections to 
compute the variances of S, <P, <P and the covariance of S with # and <P, respectively. 
These sections consist of long and tedious algebraic computations, without any interest 
beyond the fact that they prove the formulas announced above. We close the paper with 
a section on Conclusions and Discussion. 

2 Preliminaries 

2.1 Phylogenetic trees 

In this paper, by a phylogenetic tree on a set S of taxa we mean a binary rooted tree with 
its leaves bijectively labeled in the set S. We shall always understand a phylogenetic tree 
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as a directed graph, with its arcs pointing away from the root. To simplify the language, 
we shall always identify a leaf of a phylogenetic tree with its label. We shall also use the 
term phylogenetic tree with n leaves to refer to a phylogenetic tree on the set {1, . . . , n}. 
We shall denote by T(S) the set of isomorphism classes of phylogenetic trees on a set 
S of taxa, and by T n the set T({1, . . . , n}) of isomorphism classes of phylogenetic trees 
with n leaves. We shall denote by Vi n t{T) the set of internal nodes of a phylogenetic tree 
T. 

Whenever there exists a path from u to v in a phylogenetic tree T, we shall say 
that v is a descendant of u and that u is an ancestor of v. The lowest common ancestor 
LCAt(u, v) of a pair of nodes u, v of a phylogenetic tree T is the unique common ancestor 
of them that is a descendant of every other common ancestor of them. 

The depth St( v ) of a node v in T is the length (in number of arcs) of the unique path 
from the root r of T to v. The cophenetic value [17] of a pair of leaves i,j is the depth 
of their lowest common ancestor: 



fHhj) =5 T (LCA T (i,j)). 

To simplify the notations at some points, we shall also write v?t(*j i) to denote the depth 
&r(i) of a leaf i. 

Given two phylogenetic trees T,T' on disjoint sets of taxa S,S', respectively, their 
tree-sum is the tree T^T'on SL)S' obtained by connecting the roots of T and T' to a (new) 
common root. Every tree with n leaves is obtained as TjfT' k , for some 1 ^ k ^ n — 1, 
some subset Sk ^ {1, . . . , nf with A; elements, some tree Tfc on and some tree T' n _ k on 
S k = {1, . . . ,n} \ jSfc; actually, every tree T with n leaves is obtained in this way twice. 

The Yule, or Equal-Rate Markov, model of evolution |7|21j is a stochastic model of 
phylogenetic trees' growth. It starts with a node, and at every step a leaf is chosen 
randomly and uniformly and it is splitted into two leaves. Finally, the labels are assigned 
randomly and uniformly to the leaves once the desired number of leaves is reached. This 
corresponds to a model of evolution where, at each step, each currently extant species can 
give rise, with the same probability, to two new species. Under this model of evolution, 
different trees with the same number of leaves may have different probabilities. More 
specifically, if T is a phylogenetic tree with n leaves, and for every internal node z we 
denote by kt(z) the number of its descendant leaves, then the probability of T under 
the Yule model is [2TT9] 

on-l i 

vev int (T) 1 v > 

The following easy lemma on the probability of a tree-sum under the Yule model will be 
used in our computations. 

Lemma 1. Let ^ Sk £ {1, ...,n} with \Sk\ = k, let £ T{Sk) and T' n _ k G 
T({1, ■■■ ,n}\ S k ). Then 
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Proof. This equality is a direct consequence of the explicit probabilities of Py(r^), 
Py(T^_ fc ) and Py {Tk~T^_ k ) and the fact that Vi n t(TCT' n _ k ) is the disjoint union of 
V mt (T k ), V int {T^_ k ) and the root r of T k rT' n _ k . ' □ 

2.2 Harmonic numbers 

For every n 1, let 

i=l i=l 
(2) (2) 

Let, moreover, Hq = Hq = 0. H n is called the n-th harmonic number, and -ff n , the 
generalized harmonic number of power 2. It is known (see, for instance, O p. 264]) that 

11 1 

H n = ln(n) + 7 + — - — - + o(— ) 

6 n 2n 2 Vn 3 / 



where 7 is Euler's constant. 

The following identities will be used in the proofs of our main results. 

Lemma 2. For every n ^ 2; 
n-l 

(i; £ = n(^ n - 1) 
fc=i 

n— 1 -. 

^ Y,kH k = -n(n-l)(2H n -l) 

k=l 

n— 1 

(3j V fc 2 # fc = — n(n - l)((12n - 6)H n - 4n - 1) 

n-l 
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n-l 

(5) Y,H 2 k =nHl-(2n + l)H n + 2 

k=l 
n-l 

(6) £ tff = - 
fc=l 

n-l 

f7j X! H *H n -k = {n + l)(H 2 n+1 - H n % - 2H n+1 + 2) 
fc=l 

(8) J2 kH k H n ^ k = ^ 2 X ) (^n+i " flSi " 2 ^«+i + 2 ) 

Identities (l)-(6) are well known and easily proved by induction on n: see, for instance, 
the chapters on harmonic numbers in Knuth's classical textbooks [BJ §6.3, 6.4] and 
§1.2.7]. Identities (7) and (8) are proved in [20l Thms. 1,2]. 
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3 Recursive shape indices 



A recursive shape index for phylogenetic trees [10J is a mapping / that associates to each 
phylogenetic tree T a real number I(T) G R satisfying the following two conditions: 

(a) It is invariant under tree isomorphisms and relabelings of leaves. 

(b) There exists a symmetrical mapping // : N x N — > R such that, for every phylogenetic 
trees T,T' on disjoint sets of taxa S,S', respectively, 



As we shall see in later sections, the balance indices considered in this paper are 
recursive shape indices. The following two results extract a common part of the compu- 
tation of their variances. In them, and henceforth, Ey applied to a random variable will 
mean the expected value of this random variable under the Yule model. 



Lemma 3. Let I be a recursive shape index for phylogenetic trees. For every n^l, let 
I n be the random variable that chooses a tree T G T n and computes I(T). Then, 



Proof. We compute Ey(I^) using its very definition and Lemma[TJ Recall that every tree 
in T n is obtained twice as T k ^T' n _ k , for some 1 ^ k ^ n — 1, some subset S k C {1, . . . , n) 
with k elements, some tree T k on and some tree T' n _ k on S k . 



I(T-T') = I(T) + I(T') + f I (\S\,\S'\). 



Ey(iI) 



n — 1 



1 




E Y (ll)= Y, I{T?-Py(T) 



TeT n 




E E E ('mO 2 + ^Tn-uf + fl(k, n-kf + 2I{T k )I{T' n _ k ) 
+2fj(k,n- k)I(T k ) + 2fj(k,n- k)I(T' n _ k ))Py(T k )Py(T n _ k ) 



Py(T k )Py(TU) 



5 



1 n— 1 

AE(EE I(T k ) 2 PY(T k )P Y (TU) 
+ E E I(TU) 2 P Y (T k )P Y (TU) 
+ E E fl{^n-kfP Y {T k )P Y (T' n _ k ) 

K n — k 

+2]T E fi{k,n-k)I{T k )P Y {T k )P Y {TU) 
+2E E fi{k,n-k)I{T' n _ k )P Y {T k )P Y (T^_ k ) 

K n—k 

+2^ E I(T k )I(K_ k )Py(T k )P Y (K_ k )) 



l 



n — k 

n-1 



T E(E 7 (^) 2 ^V(r fe )+ ]T /(3X- fc ) 2 iV(2i_ Jfc ) + //(fc,n-fc) 5 



77 

* fc=l T fc T' , 

K n—k 



+2j2fi(k,n-k)I(T k )P Y (T k ) + 2 £ fi(k,n - k)I(T^_ k )PY(T^_ k ) 

T * K-k 
+2(£/(T fc )iV(T fc ))( ^ J(^_ fe )iV(^_ fe ))) 

K n — k 

2 n— 1 

= :— T (M4 2 ) + E Y {i 2 n _ k ) + f T (k,n- k) 2 
k=i 

+2fj{k,n- k)(E Y (I k ) + E Y {I n _ k )) + 2E Y {I k ) ■ E Y {I n _ k )) 

n— 1 

= — — r E( 2 ^(4 2 ) + 4//(fc,n - k)E Y {I k ) + 2£y(4) • £V(/ n _ fc ) + //(*:,«- fc) 2 ) 
n 1 fc=i 

as we claimed. □ 

Corollary 1. Let I be a recursive shape index for phylogenetic trees and, for every n ^ 1, 
Zet I n 6e i/ie random variable that chooses a tree T e T n and computes I(T) . Set 

ei(a, b — 1) = fi(a, b) — fi(a, 6 — 1) for every a ^ 1 and 6^2 
i?j(n - 1) = E Y (I n ) - E Y (I n -i) for every n^2 

IfE Y (h) = 0, then 



A n ~ 2 

Ey(II) = ^—E Y {ll_ x ) + — £ n - 1 - *) ■ M4) 
n i fc=i 

+-TT /j ( n " ls ^(Ai-i) + At E E Y(h)R{n -k-l) 

+ /j(n l 1 ; 1)2 + At E (//(*, - - fc ) 2 - /'(*> - - fc - x ) 2 )- 

n i n i fc=1 



Proof. By Lemma [31 

r) n— 1 > n— 1 

M3) = ^tE + ^tE " *0M4) 

n 1 fc=i n 1 fe=i 

rj n— 1 i 71—1 

+ t E E Y (Ik)E Y (In-k) + r E /-KM - fc ) 2 ' 

n — 1 r - ^ n — 1 

fc=i fe=i 

and in particular 

r) n— 2 j n— 2 

Ey(iLi) = — « E *M4 2 ) + — 5 E /KM - 1 - *)MA) 

2 n— 2 i n— 2 

+ « E E Y (I k )E Y (In-l-k) + « E /KM " fc " l f 

n — 2 f— ' n — 2 f— ' 

fe = l K = l 



Therefore 



n-2 2 !X 2 2 



M'n) = 7 • = E Eri%) + rM#-i) 

n — 1 n — 2 n — 1 



fc=i 

n-2 4 ^, 2 
+— - • — — E M4)(/K*> r» - 1 - fc) + ej(M - 1 - fe)) 

k=l 

+-^rfl(n-l,l)EY(I n -l) 

n — 1 
n-2 2 n ~ 2 

+ r jj E M4)(MA-i-fc) + Rj(n - k - 1)) 

n — 1 n — 2 

k=l 

+ -E Y {I n ^)E Y {h) 

n — 1 

r> -i n— 2 -i n— 1 

n — 2 1 . . ,n 1 



+ 7 5 E /K*. n - A: - l) 2 + E fl(k, n - kf 

n — 1 n — z f— ' n — 1 f— 

fe=i fe=i 

n-2 

E-M^-^-i)' 



n — 1 n — 2 



k=l 

n-2„ , - , 2 „ , » , 4 "~ 2 



T M^-i) + -M^-i) + E £ K^ " - 1 - A;) • MA 

1 n — 1 n — 1 



n ■ . ,. . .. . fc=1 

4 2 n " 2 



+ r/j(n - 1, l)£V(/„_i) + E MAW" - fc - 1) 

n — 1 n — 1 r - ^ 



1 n— 1 -. n— 2 



fc=l 



+ r E " " *f 7 E M*, re " k ~ 

n — If— n — 1 f— ' 

-?_M4-l) + 7 E ^ n-l-k). E Y (I k ) 

n — 1 71—1 , , 

4 , , 2 n ~ 2 



+ T fi(n - 1, l)£y (/„-i) + - E E Y {h)Ri{n — k — 1) 

n — 1 n — 1 f— 

fe=i 

1 n_2 1 

+ r E (/K*> " - fc) 2 - /j(fc, n — k — If) + • fj(n - 1, l) 2 . 

n — 1 f— ' n — 1 



fc=l 



□ 



4 The variance of Sackin's index 

The Sackin index of a phylogenetic tree T E T n is defined as the sum of the depths of 
its leaves: 

n 

S(T) = 5>(i). 

i=l 

It is well known (cf. [131 Eq. (6)]) that if T k S T(S k ), for some 7^ C {1, . . . , n}, and 
T' n _ k € T(^), then 

S{TCT' n _ k ) = S(T k ) + S(T^_ k ) + n. 

Let £ n be the random variable that chooses a tree T € 7^ and computes S(T). Its 
expected value under the Yule model is [8] 

£y(S n ) = 2n(ff n -l). 

In particular E Y (Si) = 0. Notice that E Y (S n ) satisfies the recurrence 

E Y (S n+1 ) = E Y (S n ) + 2H n . 

Indeed, 

E Y (S n+1 ) - E Y (S n ) = 2(n + l)(#„+i - 1) - 2n(F n - 1) 

= 2(n + l)(fT„ + — — - 1) - 2n(F n - 1) = 2H n . 
n + 1 

In this section we compute the variance of S n under this model. 

Theorem 1. Ey(g*) = ±n 2 {Hl - H {2) - 2H n ) - 2nH n + lln 2 - n 

Proof. As we have seen, Sackin's index satisfies the hypotheses in Corollary [1] with 
fs(k, n — k) = n, and hence es(k, n — k — 1) = 1, and Rs(k) = 2H k . Therefore 

Ey{SI) = —rE Y {Sl_ x ) + — y E Er&) 

n 1 k=i 

4 2 ^ 2 

H rrnE Y (S n -i) H } E Y (S k )2H n -k-i 

n — 1 n — 1 f— ' 

fe=i 

„2 1 ",-2 

+ ^^ + — ^V(n 2 -(n-l) 2 ) 
n- 1 n- 1 ^ v } ' 

k=l 

A n ~ 2 

= -—E Y {Sl„ x ) + E £ y (S fe ) + 8n(F n „x - 1) 

Ti — i 7i — 1 

k=l 

4 ^ 2 

+ t V] £y(S fc )#n-fc-i + 3ra - 2 

n " 1 fc = 1 

Now, by Lemma [21 

a n—2 o n—2 

fc=i fc=i 

= ^r(^ n " !)( n " 2 )( 2 ^n-i - 1) - i(n - l)(n - 2 
n — 1 v 4 2 

= 2(n-2)(2F n _ 1 -3) 
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j n—2 o n—2 



7 X E Y(Sk)H n _ k _i = — — - ^2 H H k - l)H n -k-i 

1 k=l n 1 fc=l 

q n—2 g n—2 

- V, kH k H n - k _i 2_ \ kH n - k -i 

1 n — 1 f— i 



n ' fe=i ' fc=i 

o n-2 o n-2 



— ^ kFL k H n _ k _ x - —— - J^in-k- l)H k 



n " fc=i " fe=i 



n—2 q n—2 

= 4n(F 2 - ff(2) _ 2H n + 2) - 8 £ F fc + — - ]T A;# fc 

fc=i n fe=i 
= 4n(F 2 - flX 2 ) - 2H n + 2) - 8(n - l)(tf n _i - 1) + 2(n - 2)(2tf n _ 1 - 1) 

= 4n(Hl - - 2iJ n _! - 2 ■ - + 2) - 4nH n ^ + 6n - 4 

= 4n(# 2 - - 3fl- n _i) + 14n - 12 

And thus 

£y(S n ) = ^Y^r(^n-i) + 2(n - 2)(2F n _x - 3) + 8n(H n ^ - 1) 
+4n(H 2 - - 3F n _i) + 14n - 12 + 3n - 2 
" T MSn-i) + 4n(#2 _ ff (2)) _ 8Hn ^ + 3 n - 2 



n 

?2 



Setting x n = Ey(S 2 ) /n, this equation becomes 

x n = X n _! + 4(tf 2 _ ^(2)) _ + 3 - - 

n n 

The solution of this equation with x\ = is 



g(4(F|-<)- 8 ^ + 3-|) 

n n—1 tt n i 

-Hfc „, ^ „ x-^ 1 

A- 

fc=2 fc=l 1 " fc=2 
n 

r(2) 



4E(^ 2 -4 2) )-8EirT + 3(--l)-2E 

fc=2 fc=l " t " fe=2 

n 

4]T(# 2 - 4 2) ) - 4(flS " ^n 2) ) + 3(n - 1) - 2(2T B - 1) 



fc=2 
n—1 

4E(^I-4 2) )-2 J F/n + 3n-l 

fc=2 



4(n# 2 - (2n + l)fT n + 2n - n^ 2) + H n ) -2H n + 3n-l 
An{H 2 n - - 2H n ) - 2H n + lln - 1 



and therefore 



E Y {S 2 ) = nx n = 4n 2 (Hl - - 2H n ) - 2nH n + lln 2 - n 
as we claimed. □ 
Corollary 2. The variance of S n under the Yule model is 

a Y (S n ) = 7n 2 - 4n 2 HP - 2nH n - n. 
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Proof. It is obtained by replacing in the formula Oy(S n ) = Ey(S 2 ) — EY(S n ) 2 the value 
of Ey(S 2 ) obtained in the last theorem and Ey(S n ) = 2n(H n — 1). □ 

From this exact formula we can obtain an 0(l/n) approximation of a Y (S n ), which 
refines the limit formula obtained in pQ. 

Corollary 3. o- Y (S n ) = (7 - — )ra 2 + n(3 - 2 ln(n) - 2 7 ) - 3 + o(-) . 
5 The variance of the total cophenetic index ^ 

The total cophenetic index of a phylogenetic tree T £ T n is defined as the sum of the 
cophenetic values of its pairs of leaves: 

<f>(T) = J] 

By [HI Lem. 4], if T k G T(Sfc), for some ^ 5& C {1,... ,n} with k elements, and 
T' n _ k G T(^), then 

<P(TfT n _ k ) = 0(T k ) + <Z>(T n _ fc ) + r J + 

Therefore, ^ is a recursive shape index with n — /c) = (2) + (^2*% an< ^ ™ particular 
n — k — 1) = n — k — 1. 
Let ^ n be the random variable that chooses a tree T € T n and computes its total 
cophenetic index <P{T). The expected value under the Yule model of <P n is [11] 

£y(£„) = n (n - 1) - 2n(i7 n - 1) = n{n + 1 - 2H n ). 

In particular, = 0. This expected value satisfies the recurrence 

E Y (<P n ) = E Y {$ n -i) + 2(n - 1 - H n -i), 

and therefore R(k) = 2(k — H k ). 

In this section we compute the variance of <fr n under the Yule model. 

Theorem 2. E Y {$ 2 n ) = 4n 2 (Hl - H^) - 2(2n 3 + 2n 2 + 3n)# n + ^(13n 4 + 14n 3 + 
143n 2 -2n) 

Proof. $ satisfies the hypothesis of Corollary [H with 

e$(k,n - k - 1) = n - k - 1, R(k) = 2(k - H k ). 



(n — k\ 
2 y 
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Therefore, by the aforementioned result, 



E 2^y(*fc)((n - fc - l) - ff„- fc -i) + ( n 2 



-+ 

n — i 

n-2 



o n— 2 

^r^y^-i) + — — r 2(n - fc - l)(fc 2 + fc - 2kH k ) + 2(n - 2)(n - l)(n - 2iJ n _i) 
1 n 1 fc=i 

4 n ~ 2 1 

— E H n-k-i(k 2 + k- 2kH k ) + j^(n- 2)(7n 2 - 21n + 12) 



Sy(^_ 1 )-4(n-2)(n-l) J ff„_ 1 



n—2 -i ^ n— 2 

lb 



n — 1 

-16^^ + -^-^^ 

k=i n k=i 

^ n—2 ^ n— 2 g n— 2 

7 E k 2 H n _ k _i y~] kH n _ k _i H y~] kH k H n _ k _i 

k=l k=l k=l 

+^(n - 2)(39n 2 - 37n + 12) 

E Y (<P 2 n _ 1 )-4(n-2)(n-l)H r ^ 1 



n - 1 

-16^^ + -^-^^ 
r - i n — 1 



n—2 n—2 



fe=l fc=l 

n—2 i n—2 n—2 



T E ( n " fc " l ? H k T E ( n " fe " l ) H * + T E kH k H n - k -! 

n — 1 n — 1 n — 1 

k=l k=l k=l 

+— (n - 2)(39n 2 - 37n + 12) 



n 



n 



-E Y {& n _ 1 )-±{n-2){n-\)H n _ l 

n—2 -i /i n—2 o n—2 

-16 E kH k + E k 2 H k + E fctfifc#n-fc-l 

n—2 n—2 ^ n—2 n—2 < n—2 

-4(n - 1) E + § E 7 E ^ " 4 E + 7 E 

K=l fe=l k=l k=l k=l 

+— (n - 2)(39n 2 - 37n + 12) 



n 



-E Y {<P 2 l _ 1 )-A(n-2){n - l)H n ^ 



n 

12 n ~ 2 12 — 8n n ~ 2 n ~ 2 8 n_2 

+ T E fc2 ^fe + r E - 4n E ^fc + t E fc^fc^n-fc-i 

"-^=1 fc =i "-ifc 



n 



n 
n — 



l E Y {& n _ 1 )-±{n-2){n-\)H n _ l 

+-^~ ■ ;^(n - l)(n - 2)((12n - 18)fT n _i - 4n + 3) 
n^— 1 oD^ v 7 

+ • 7 (n - l)(n - 2)(2tf n _x - 1) - 4n(n - l)(ff n _i - 1) 

n — l 4 



+ 



n 



^ Q (tf 2 _ (2) _ 2Fn + 2) + _L (n _ 2)(39n 2 _ 37n + 12) 



' l -Sy(#2_ 1 ) + 4n(^_ jff (2) ) _ 8niJn 



n — 1 

-8(n - l) 2 #„_i + ^(39ra 3 - 59n 2 + 94n + 24) 



n 



-E Y {$1^) + 4n(# 2 _ #(2)) _ 8(n 2 - n + l)# n _i + ^(39n 3 - 59n 2 + 94n - 72) 



n 

r2 



Setting x n = E Y ($ n )/n, this equation becomes 

x n = x n _! + 4(# 2 - ^) - %(n - 1 + -)ff„_i + -^(39n 2 - 59n + 94 ) 

The solution of this equation with x\ = is 

x n = ]T (4(F 2 - tff) - 8(fc - 1 + T)H k -i + T^(39fc 2 - 59fc + 94 - -)) 



k=2 

n n n—l n—l 



kr~»~ x • 12 

fe=2 fc=2 fc=l fc=l ft + 1 11 k=2 fc 

n—l n—l n—l 



n—l n—l n—l -i n ~r) 

= 4E^- 4 E^ 2) -8E^ + ^E(3^ 2 -59^ + 94-^) 

k=2 k=2 k=l k=2 

= 4n{H% - i4 2) ) - 8nH n + 8n - 2n{n - l)(2H n - 1) - 6H n + — (13n 3 - 10n 2 + 71n - 2) 

1 12 

= 4n(F 2 - H n 2) ) - 2(2n 2 + 2n + 3)H n + ^(13n 3 + 14n 2 + 143n - 2) 

Therefore 

E Y {$1) = nx n = 'in 2 {Hi - H n 2) ) - 2(2n 3 + 2n 2 + 3n)H n + ^(13n 4 + 14n 3 + 143n 2 - 2n) 

as we claimed. □ 

Corollary 4. a Y {$ n ) = -L(n 4 - 10n 3 + 131n 2 - 2n) - 4n 2 # n 2) - 6ntf n 

Proof. Simply replace in the formula <r Y (^ n ) = E Y {® 2 n ) - E Y (<P n ) 2 the value of £y (<£ 2 ) 
obtained in the last theorem and the value of E Y (<P n ) recalled above. □ 

ln 4 -^n 3 +fHi_^ 

12 6 V 12 3 /" vv ' V 6 " ' " ' - V„, 



Corollary 5. a 2 (# n ) = l n 4 - |n 3 + - ^)n 2 -6nln(n)+ - 7 )n-5 + o( i) . 



6 The variance of # 

For every T G 7^, let 



c£(T)= <?(T) + <2>(T) = E Mm) 

l^i^j'^n 
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Lemma 4. If T k G T(S k ), with ^ 5 fc C {1, . . . , n} and \S k \ = k, and T' n _ k G T{S c k ), 
then 

5(rrr B _ fc ) = ${T k ) + <P{T n _ k ) + ( k + l \ + / "' A ' + |X 



2 

Proof. Since S(TCT n _ k ) = S(T k ) + S(T n _ k ) + n and 

${TfT n _ k ) = <P(T k )+<P(T n _ k ) + + ~ A ' 



we have that 

2 7 V 2 

D + ( 2 



? (T r r„_ t )^(r,) + #(T„_ fc) + (^ + r- fc ) + „ 

/fc + l\ / n — k + 1 

#(T fc ) + <?>(T„_ fe ) + ; + 



□ 



So, ^ is a recursive shape index for phylogenetic trees with f$(a,b) = ( a 2 ) + ( 2 )> 
and hence £$(a, 6) = 6 + 1. 

Let be the random variable that chooses a tree T G T n and computes <P(T). Its 
expected value under the Yule model is [TT] 

= n(n - 1). 

In particular, Eyi&i) = and R$(k) = 2k. In this section we compute the variance of 

Theorem 3. E Y ($ 2 n ) = ^(13n 4 - 30n 3 + 23n 2 - 6n). 
Proof. <P is a recursive shape index for phylogenetic trees with 

e^O, n - k - 1) = n - k, R(k) = 2(k). 

Then, by Corollary [fl 

n-2 



n — 1 n — 1 ^ n — l\\zy / 

+ ^i| 2 ^^(»- t - 1 » + ^i(G) + ©- 2 



(((* + + (» - *+ ^» _ (i* t v ^ 



n — 1 

n-2 



4 

n — 1 

n-2 



n — x X> - *)*(* - !) + ((2) + x ) (n - 1)(n - 2) 



+— ^- V fcffc - l)(n — k — 1) 

n - 1 r - ! 

K = l 

n-2 



J? 



^| (2{ „_ fc)( (^>) + ("-^ ) + ( „_^ ) + _l_ ( +1 



2 



n ^ ,-2 s , 1 , „ 2 



£V(<Z> n -i) + 7^(13^ - 33n + 22) 



n - 1 J v 4 
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2 

Setting x n = E Y (<P n )/n, this recurrence becomes 

x n = Xn-! + ^(13n 2 - 33n + 22) 
and the solution with x\ = is 

x n = ^(!3n 3 - 30ra 2 + 23n - 6) 

from where we deduce that 

Ey($1) = nx n = ^( 13n4 - 30n3 + 23n2 ~ Qn ) 
as we claimed. □ 
Corollary 6. The variance of <P n under the Yule model is 

<4(*n) = 2, s 

Proof. Simply apply that a Y ($ n ) = E Y ($ n ) - E Y ($ n ) 2 - □ 
7 The covariances 

In this section we obtain the covariance under the Yule model of S n and <P n from the 
formulas obtained in the previous sections for E Y (@ n ), E Y (S 2 ) and E Y ((p 2 n ). 

Corollary 7. couy^, <£>„) = 4n(nH n 2) + H n ) + \n{n 2 - 51n + 2). 

Proof. Recall that covY(S ni & n ) = E Y (S n ■ $ n ) - E Y {S n ) ■ E Y (0 n ). Now 

E Y 0i) = E Y ((S n + <2> n ) 2 ) = E Y {S 2 n ) + E Y (<P 2 n ) + 2E Y (S n ■ <P n ) 

and therefore 

E Y (S n ■ $ n ) = \{E Y {$ 2 n ) - E Y (S 2 n ) - E Y ($D) 

from where we obtain, replacing E Y ($ n ), E Y (S 2 ) and E Y {<P n ) by their values obtained 
in the previous sections, that 

E Y {S n • <P n ) = 2n(n 2 + 3n + 2)F n - An 2 {H 2 n - H n 2 ^) - ^n 3 - yn 2 + in. 

Subtracting E Y {S n ) • E Y (<P n ) = 2n 2 (H n — l)(n + 1 — 2H n ) to this expression, we finally 
obtain the formula in the statement. □ 

Corollary 8. cov Y (S n , $ n ) = 2nH n + ^n(n 2 — 9n - 4) 

Proof By the bilinearity of covariances, cov Y {S n , <P n ) = cov Y (S n , S n + <& n ) = a Y (S n ) + 

COV Y (S n ,&n)- □ 
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Corollary 9. 

r 2 

C" ' V 3 2 ) ' 3 V ~~ ' " Vjv 



cov Y (S n , <2> n ) = ^n 3 + - y)n 2 + 4nln(n) + i(12 7 - ll)n + 4 + o(^) 

13 1 1 

co« Y (5 n ,^ n ) = -n 3 - -n 2 + 2nln(n) + -(67 - 2)n + 1 + o(-) 



6 2 " v y 3 V ' ' Vn> 

From the formulas for Oy(^„), Oy and ccwy (5 n , ^ n ), we can compute Pearson's 
correlation coefficient between S n and 

^ \ cov Y (S n ,<P n ) 
cor Y (S n ,<P n ) = 



The exact formula for this coefficient is 

4n(nHi 2) + ff ra ) + |n(n 2 - 51n + 2) 

yj (7n 2 - An 2 H ( n ] - 2nH n - n) (^(n 4 - 10n 3 + 131n 2 - 2nj - 4n 2 H^ ] - 6nH n ) 

and in the limit it is equal to 

cor Y (S n ,$ n ) = = = 0.89059 

8 Conclusions 

In this paper we have obtained exact formulas for the variance under the Yule model of 
the Sackin index S, the total cophenetic index and their sum 4>, and for the covariances 
of S and <P, Unlike other expressions published so far in the literature, our formulas 
are valid on spaces T n of binary phylogenetic trees with any number n of leaves, and not 
only asymptotic formulas for large such n, and they are not recursive, but explicit. 

The proofs consist of elementary, although long and involved, algebraic computations. 
Since it is not difficult to slip some mistake in such long algebraic computations, to 
double-check the results we have directly computed these variances and covariances on 
T n for n = 3, . . . , 9 and confirmed that our formulas give the right results. The values 
obtained are given in the next table. The Python scripts used to compute them are 
available from the authors. 



(Sn) 

.) 



u Y 

a Y (<P. 
a Y {$ n ] 



COr Y (S n ,$n) 



0.222222 0.805556 
0.888889 5.138889 
2 10 
0.444444 2.0277778 
1 0.996639 



6 
1.84 
17.04 
30 
5.56 
0.992958 



Table 1. Values of a Y (S n ), cr Y ($ n ), cr' Y ($, 
agree with the values given by our formulas. 



7 



8 



9 



3.877778 5.49424 8.193827 
42.787778 90.522812 170.350969 

70 140 252 

11.912222 21.991474 36.727602 

0.989408 0.986101 0.983053 
.), cov Y (S„, $„), and cor Y (S„,<i> T 



) for n = 3, . . . , 9. They 
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It can be seen in this table that the values of the variance of S n are smaller than 
those of the variance of <P or <P. Actually, as we have seen in the text, Oy (S n ) has order 
0(n 2 ), while <7y(^ n ) and <7y(^ n ) are 0(n 4 ). This is consistent with the fact that <P and 
<P have larger spans of values than S, 0(n 3 ) instead of 0(n 2 ), and much less ties. It is 
also deduced from the formulas obtained in this paper, and from this table for small 
values of n, that there is a strong direct linear correlation between S n and <P n , although 
in the limit Pearson's coefficient between them decreases to 0.89. 

It remains to compute an exact formula for the variance of Colless' index C and of 
the covariances of C and S, The variance can be obtained using again Corollary [IJ 
while the covariances can be obtained using a result similar to that corollary, giving a 
recurrence for the expected value of the product of two recursive shape indices. In both 
cases, the computations are much longer and more involved than those included here. 
We shall report on them elsewhere. 

Acknowledgements 

The research reported in this paper has been partially supported by the Spanish govern- 
ment and the UE FEDER program, through projects MTM2009-07165 and TIN2008- 
04487-E/TIN. We thank J. Miro for several comments on a previous version of this 
work. 

References 

1. M. G. B. Blum, O. Frangois, S. Janson, The mean, variance and limiting distribution of two statistics 
sensitive to phylogenetic tree balance. Ann. Appl. Probab. 16 (2006), 2195-2214. 

2. J. Brown, Probabilities of evolutionary trees. Syst. Biol. 43 (1994), 78-91. 

3. L. L. Cavalli-Sforza, A. Edwards, Phylogenetic analysis. Models and estimation procedures. Am. J. 
Hum. Genet., 19 (1967), 233-257. 

4. D. H. Colless, Review of "Phylogenetics: the theory and practice of phylogenetic systematics" . Sys. 
Zool, 31 (1982), 100-104. 

5. J. Felsenstein, Inferring Phylogenies. Sinauer Associates Inc., 2004. 

6. R. Graham, D. Knuth, O. Patashnik, Concrete Mathematics. Addison- Wesley (1994). 

7. E. Harding, The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. 
Prob. 3 (1971), 44-77. 

8. S. B. Heard, Patterns in Tree Balance among Cladistic, Phenetic, and Randomly Generated Phylo- 
genetic Trees. Evolution 46 (1992), 1818-1826 

9. D. Knuth, The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition). 
Addison- Wesley (1997). 

10. F. Matsen, Optimization Over a Class of Tree Shape Statistics. IEEE/ACM Trans. Comput. Biol. 
Bioinformatics, 4 (2007), 506-512. 

11. A. Mir, F. Rossello, L. Rotger, A new balance index for phylogenetic trees. arXiv:1202.1223vl [q- 
bio.PE] (2012), submitted to Math. Biosc. 

12. A. Mooers, S. B. Heard, Inferring evolutionary process from phylogenetic tree shape. Quart. Rev. 
Biol. 72 (1997) 31-54. 

13. J. S. Rogers, Central moments and probability distributions of three measures of phylogenetic tree 
imbalance, Sys. Biol. 45 (1996), 99-110. 

14. D. E. Rosen, Vicariant Patterns and Historical Explanation in Biogeography. Syst. Biol. 27 (1978), 
159-188. 

15. M. J. Sackin, "Good" and "bad" phenograms. Sys. Zool, 21 (1972), 225-226. 

16. K.T. Shao, R. Sokal, Tree balance. Sys. Zool, 39 (1990), 226-276. 

17. R. Sokal, F. Rohlf, The Comparison of Dendrograms by Objective Methods. Taxon 11 (1962), 33-40. 



16 



18. M. Steel, A. McKenzie, Distributions of cherries for two models of trees. Math. Biosc. 164 (2000), 
81-92. 

19. M. Steel, A. McKenzie, Properties of phylogenetic trees generated by Yule-type speciation models. 
Math. Biosc. 170 (2001), 91-112. 

20. C. Wei, D. Gong, Q. Wang, Chu-Vandermonde convolution and harmonic number identities. 
arXiv:1201.0420vl [math.CO] (2012) 

21. G. U. Yule, A mathematical theory of evolution based on the conclusions of Dr J. C. Willis. Phil. 
Trans. Royal Soc. (London) Series B 213 (1924), 21-87. 



17 



